Trans‐ethnic polygenic risk scores for body mass index: An international hundred K+ cohorts consortium study

Abstract Background While polygenic risk scores hold significant promise in estimating an individual's risk of developing a complex trait such as obesity, their application in the clinic has, to date, been limited by a lack of data from non‐European populations. As a collaboration model of the International Hundred K+ Cohorts Consortium (IHCC), we endeavored to develop a globally applicable trans‐ethnic PRS for body mass index (BMI) through this relatively new international effort. Methods The polygenic risk score (PRS) model was developed, trained and tested at the Center for Applied Genomics (CAG) of The Children's Hospital of Philadelphia (CHOP) based on a BMI meta‐analysis from the GIANT consortium. The validated PRS models were subsequently disseminated to the participating sites. Scores were generated by each site locally on their cohorts and summary statistics returned to CAG for final analysis. Results We show that in the absence of a well powered trans‐ethnic GWAS from which to derive marker SNPs and effect estimates for PRS, trans‐ethnic scores can be generated from European ancestry GWAS using Bayesian approaches such as LDpred, by adjusting the summary statistics using trans‐ethnic linkage disequilibrium reference panels. The ported trans‐ethnic scores outperform population specific‐PRS across all non‐European ancestry populations investigated including East Asians and three‐way admixed Brazilian cohort. Conclusions Here we show that for a truly polygenic trait such as BMI adjusting the summary statistics of a well powered European ancestry study using trans‐ethnic LD reference results in a score that is predictive across a range of ancestries including East Asians and three‐way admixed Brazilians.


INTRODUCTION
Obesity is a global health issue, 1 with an adult prevalence of about 13% across the world (https://www.who.int/newsroom/fact-sheets/detail/obesity-and-overweight). As a difficult condition to treat, obesity prevention is important which has led us to develop a trans-ethnic polygenic risk score (PRS) for body mass index (BMI) through the International HundredK+ Cohorts Consortium (IHCC). 2 PRS aggregates the effects of many genetic variants across the human genome into a single score, which may effectively improve the prediction of a complex disease/trait and assist the differential diagnosis. 3 BMI-associated variants have been under natural selection, and explain 30−40% variance for BMI. 4 Previous study has shown that PRS is an important determinant of BMI across life. 5 In addition to obesity prevention, PRS for BMI may have clinical applications, for example, the prediction of cardiometabolic health. 6 Currently, like other health outcomes, there is a serious lacking of genomic information for BMI in minor populations. The development of PRS for BMI in minorities warrants for research efforts to avoid health disparities.
Our obesity PRS was based on the published GWAS meta-analysis of BMI that included 339 224 individuals of European ancestry. The study identified 97 genomewide significant BMI-associated loci that account for approximately 2.7% of BMI variation alone. Genome-wide estimates suggest that common variation accounts for over 20% of variation in BMI. Various approaches for PRS calculation have been developed to date. [7][8][9] The standard approaches for calculating risk scores involve linkage disequilibrium (LD)-based marker pruning followed by pvalue thresholding of GWAS-based association statistics. While effective, these approaches lose information and can reduce predictive accuracy particularly where the test population differs in genetic ancestry from the GWAS sample. Bayesian approaches such LDpred, 7 a method that infers the posterior mean effect size of each marker by using a prior on effect sizes and LD information from an external reference panel may therefore improve prediction accuracy in multi-ethnic studies of diverse populations. 7 As most large scale GWAS have been conducted using only individuals of European ancestry there is a need to develop approaches that can port PRS using European ancestry derived effect estimates. More importantly, the different prevalences of BMI and obesity across human populations are closely related to environmental factors and cultural diversity. 10,11 Therefore, it is essential to validate a multi-ethnic PRS in different regional populations, especially admixed populations. Leveraging an international effort supported by the IHCC that has brought together large scale cohorts with genotyping data from around the world, we explored the development of an LDpred-based, trans-ethnic (TE) obesity PRS through a collaboration of 6 international research centers ( Figure 1).

Model training at the Children's Hospital of Philadelphia (CHOP)
The PRS model development and training was carried out at the Center for Applied Genomics (CAG) according to the pipeline shown in Figure 1. SNP weights (i.e. posterior mean effect sizes) were calculated using Markov chain Monte Carlo (MCMC) Gibbs sampling as implemented in LDpred. 7 The summary statistics of genetic association with BMI were based on the meta-analysis of genomewide association studies by the GIANT consortium. 12 We restricted the variants to SNPs included in the HapMap3 data. Five sets of SNP weights were generated based on the respective LD patterns from the following populations, African American (AA), Hispanic American (HAMR), East Asian (ASN), Northern European (EUR) and all of the above populations (i.e. trans-ethnic). For each group we selected 2500 CAG participants who clustered with the 1000 genome project reference data 13

Initial validation of PRS models in-house
The initial assessment of the PRS models was based on the genomic data in-house at CAG, CHOP. The pediatric biobank built at CAG has archived samples from 500 000 children from USA, Europe, South America, Canada, Saudi Arabia and Australia. 14 The definition of normal BMI for children is age and sex specific. 15 Instead of developing a TE PRS for adult BMI, this study aimed for a binary TE PRS model for extreme BMI, that is, the top 1% and 5% BMI. These percentiles are in line with the clinical definition of childhood obesity, that is, childhood overweight as BMI ≥95th percentile, 16 and severe obesity as BMI ≥99th percentile. 17 For the initial validation of the trans-ethnic PRS model, 57 613 randomly selected individuals (51% males and 49% females) including individuals of European ancestry, African Americans, Hispanics/Latinos, and East / South Asians in order of frequency, with genotypes and BMI data from the CAG biobank were used for the validation. The principal component analysis (PCA) plot of the population structure is shown in Figure 1. All the individuals have been genotyped with an Illumina Genotyping BeadChip with at least 550 000 SNPs genotyped. Genomewide imputation was done using the TOPMed Imputation Server Harmonization of SNP alleles in the PRS model was confirmed by comparing with the reference alleles of the TOPMed imputation.

Validation of PRS models in regional populations
Having validated the trans-ethnic PRS models, we shared the protocol of the PRS models, as well as the SNPs and weights with the IHCC collaborators for assessment in 7 different cohorts of regional populations. The population sites included a three-way admixed Brazilian cohort from ELSA-Brasil, 18 (Figure 1).
Each site generated the PRS on their cohort following the same protocol and using the same tools and the software packages. To compare the performance of trans-ethnic PRS vs. that of population-specific PRS in different regional populations, we requested each collaboration group run two calculations if possible, that is, one PRS calculation for the specific population that is closest to their dataset, and one PRS for the trans-ethnic score. To do the dichotomous receiver operating characteristic (ROC) curve analysis, the top 5% and 1% of the BMI distribution within each study were defined as cases.

Participating IHCC cohorts
The Nurses' Health Study I recruited 121 700 married registered nurses in 1976. Blood samples were collected from 33 000 participants in 1989−90, and cheek cells from another 33 000 in 2001−4. Genome-wide association data are available on over 17 000 participants as part of studies of multiple complex diseases and traits, including breast cancer, type 2 diabetes, venous thromboembolism, and depression. All women included in these analyses are of European ancestry (cluster with European reference samples and do not self-report as other than European ancestry). The Nurses' Health Study II recruited 116 430 married registered nurses between the ages of 25 and 42 in 1989. Blood samples were collected from 29 000 participants in 1996−99, and cheek cells from another 30 000 in 2006. Genome-wide association data are available on over 12 000 participants as part of studies of multiple complex diseases and traits, including breast cancer, type 2 diabetes, venous thromboembolism, and depression. All women included in these analyses are of European ancestry (cluster with European reference samples and do not self-report as other than European ancestry).
Body mass index was self-reported at time of blood draw or cheek cell collection. Diabetes cases were defined as selfreported diabetes confirmed by a validated supplementary questionnaire.
The Shanghai Women's Health Study (SWHS) is a large population-based prospective cohort study initiated in 1996. 20 Approximately 75 000 Chinese women who lived in Shanghai were recruited into the study. In addition to survey data, blood and urine samples were collected from most study participants at the baseline recruitment.
The Shanghai Men's Health Study (SMHS) is a population-based cohort study of 61 480 Chinese men between ages 40 and 74 who lived in eight urban communities in Shanghai at enrollment (2002)(2003)(2004)(2005)(2006). 22 Detailed information on dietary and other lifestyle factors was collected at baseline and updated in follow-up surveys. Biological samples (blood, and or urine) were collected from 89% of cohort members.
The Norwegian Mother, Father and Child Cohort Study (MoBA) was established and is conducted by the Norwegian Institute of Public Health (NIPH). MoBa is an ambitious family-oriented cohort study that aims to find causes of diseases and explain trajectories and variability of health-related traits over a life-course span. Between 1999 and 2008, pregnant women were invited to take part in the study around the time of the ultrasound examination in week 17−20 of gestation. The fathers of the children were also invited to participate. Biological material has been collected from mothers, fathers and children and has been stored in a biobank. Self-reported data are collected from regular questionnaires about general health, diet and environmental exposure. The cohort includes approximately 109 000 children, 91 000 women and 71 700 men. 50 290 Northern European adult males and females were analysed in this study, with mean BMI = 24.96(SD = 3.9).
The Brazilian Longitudinal Study of Adult Health (ELSA-Brasil) enrolled 15 105 civil servants aged 35 to 74 years-old living in six cities, 18 addressing the incidence of non-communicable diseases. From the 15 105 participants, 9333 DNA samples were analyzed for genetic ancestry using a software tool for maximum likelihood estimation of individual ancestries from multilocus SNP genotype datasets. 19 The

Initial validation of PRS models in-house
The performance of the trans-ethnic PRS model tested inhouse is shown in Figure 2. As shown by our analysis, AUC > 0.720 to predict top 1% BMI was achieved in all the PRS models with mixture probability≥0.03. The highest AUC is seen at the mixture probabilities of 0.03-0.1.

Evaluation of the PRS in the four European ancestry cohorts
The variance explained by the PRS in European was 0.095 by the EUR PRS, and 0.092 by the TE PRS. Across all four European ancestry cohorts both European-specific PRS models and the trans-ethnic PRS models had the largest AUCs with a mixture probability of 0.03. Across the three cohorts where both the European-specific PRS model and the trans-ethnic PRS model were run (INTERVAL, NHS and NHSII), the performance of the scores was comparable (Table 1).
Only the European-specific PRS models were tested in the Norwegian MoBa study. The PRS model showed AUC≥0.686 to predict top 5% BMI, and AUC≥0.735 to predict top 1% BMI, in the models with mixture probability ≥0.03 (Table 2).

Validation of the trans-ethnic PRS models in regional populations
By the Chinese population in the SMHS and SWHS cohorts, the PRS models were tested in 866 men in the SMHS cohort and 3120 women in the SWHS cohort and F I G U R E 2 AUC values following the initial validation of PRS models in-house. 57 613 randomly selected individuals with genotypes and BMI data from the the Center for Applied Genomics biobank at the Children's Hospital of Philadelphia were used for the validation.  (Table 3). The Brazilian population in the ELSA-Brasil cohort is three-way admixed. We therefore compared the transethnic score in this population to each of the three founders: American, African and European. The transethnic score outperformed all three founder population ancestry-specific scores at the most informative fraction of 0.03 at both 95th and 99th percentiles (Table 4).

DISCUSSION
Clinically, the trans-ethnic PRS may assist the prediction of dynamic BMI for primary prevention of overweight. A prediction model with AUC = 0.845 developed by Welten et al. 24 took into account of the predictors including maternal BMI, paternal BMI, as well as birthweight, sex, and a number of socioeconomic and environmental factors. The PRS can effectively address the impreciseness of inheritance information represented by maternal and paternal BMI as there is 50% chance of inheritance for each parental allele. In order for polygenic risk scores to achieve their clinical potential, and avoid exacerbating health disparities due to the lack of genomic information in minorities, they have to be universally applicable regardless of a patient's genetic ancestry. Ideally, trans-ethnic PRS would be generated from true trans-ethnic GWAS; however, well powered trans-ethnic studies remain the exception for the majority of traits and phenotypes. In this study we evaluated the performance of a BMI PRS that is based on European ancestry GWAS effect sizes combined with trans-ethnic LD patterns using a Bayesian approach as implemented in LDpred. 7 In a federated model, we developed the PRS score at CAG and disseminated the standard operating procedure along with the SNPs and weights files and the population specific LD matrices to participant sites within the IHCC. This model allowed us to quickly test the hypothesis in various world populations without the need for data transfer and hence time-consuming data sharing agreements. By providing detailed protocols and all required files to run the scores, we made efforts to minimize the work load on each collaboration group while enabling direct comparisons between groups. LDpred has been demonstrated of comparable performance in BMI PRS to pruning and thresholding (P + T) approaches, for example, PRSice-2. 25 In particular, LDpred applies the LD information from a population-specific reference panel. For the purpose of testing a TE PRS score, LDpred allows the comparison of TE score with the score by a population-specific reference panel.
In the TE PRS model, the gene enrichment analysis using the WEB-based Gene Set Analysis Toolkit (WebGestalt) [7] based on the Molecular Signatures Database (MSigDB) hallmark gene set collection 26 showed that the SNPs with absolute(β) ≥2.0E-04 are enriched in the gene sets HALLMARK_UV_RESPONSE_DN and HALLMARK_ESTROGEN_RESPONSE_EARLY (False discovery rate < 0.05, Data, Supporting Information). It is worth to mention that, in contrast to the LDpred 7 used in this study, a new version of the method LDpred2 has been released. 27 LDpred2 made a significant effort to address the potential bias of Gibbs sampling in the human leukocyte antigen (HLA) region at chromosome 6 with extended LD. 27 In contrast, the HLA region has been removed in the LDpred modeling. The HLA region is highly polymorphic, with highly diverse frequencies across different populations, as well as extended and strong LD due to significant evolutionary selection pressure in human populations. 28,29 Including the HLA region will cause significant difference of PRS across different ethnicities, as we have observed in the trans-ethnic scoring of an autoimmune disease, type 1 diabetes (T1D), with HLA as a major risk factor. 30 In addition, there has been no GWAS study to date suggesting the role of the HLA region in BMI or obesity. Nevertheless, it is interesting to leverage the IHCC resources to examine the performance of LDpred2 in the trans-ethnic scoring of BMI.
The performance of the trans-ethnic PRS in different cohorts is comparable to the published PRS models for the prediction of obesity in populations with European ancestries, for example, AUC = 0.708 in the Europeanancestry participants of the UK Biobank, 31 AUC = 0.619 to 0.704 in the Quebec Family Study. 32 In our study, LDpred outperformed LDpred-inf with the fractions of causal markers (1, 0.3, 0.1, 0.03), but not with the other fractions (0.01, 0.003, and 0.001) lower than the above. As the results show, our trans-ethnic PRS models outperformed the ancestry specific models in the non-European populations tested including Chinese and Brazilian. Importantly the UK and US data from the INTERVAL and NHS cohorts demonstrate that there is no appreciable loss in predictive power in European ancestry individuals when using a trans-ethnic score. As such, and in the absence of trans-ethnic effect sizes from diverse GWAS, we propose that using trans-ethnic reference data to adjust the summary statistics for the effects of LD patterns improves the performance of PRS in populations that have not been included in the generation of the summary stats.
On the other hand, in the absence of reference panels for different populations, the genome-wide LD scores and matrices from the Pan-UK Biobank resource, or calculated by genome sequencing data of the Genome Aggregation Database (gnomAD), may provide sufficient resolution. As the first effort by the IHCC to leverage the existing datasets that reside within this large scale consortium for a trans-ethnic PRS on BMI, we envision an opportunity to scale this to other cohorts within the consortium and expand the number of traits that can be analyzed. As such, the IHCC presents a rich resource of data for collaborative research with trans-ethnic focus, where there is much unmet need at the present time and an area of research that has been largely ignored. In this study, the performance of the trans-ethnic PRS model is relatively poorer in the Brazilian cohort. The prevalence of obesity is high in the Brazilian population. 33 In addition to genetic heterogeneity of the admixed population, the underperformed PRS may be also due to uncounted environmental and socioeconomic factors in this population. 34 The current study is a proof-of-principle cross-network pilot for the IHCC consortium in demonstrating feasibility across a condensed timeline, and needs benchmarking. The strategy presented in the current study was to develop and share a technical protocol easily applicable to different sites. In the meantime, a number of large-scale GWAS studies have been published in other populations, for example, East Asian populations. [35][36][37] An updated meta-analysis with the GWASs in other populations may improve the current PRS model. However, to redo the meta-analysis will need access to individual data to redo the genotype imputation. To address data sharing barriers across different international research centers warrants for more extensive research efforts.

A C K N O W L E D G E M E N T S
We thank all the participants who contributed to and enabled this study. The online supplementary file provides comprehensive details about the funding.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
A.S.B declares grants outside of this work from AstraZeneca, Bayer, Biogen, BioMarin, Merck and Sanofi. All other authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

C O N S E N T F O R P U B L I C AT I O N
All authors have provided consent for publication of the manuscript.